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Speciation is fundamental to the huge diversity of life on Earth. Evidence suggests reproductive 
isolation arises most commonly in allopatry with a higher speciation rate in small populations. 
Current theory does not address this dependence in the important weak mutation regime. Here, 
we examine a biophysical model of speciation based on the binding of a protein transcription factor 
to a DNA binding site, and how their independent co-evolution, in a stabilizing landscape, of two 
allopatric lineages leads to incompatibilities. Our results give a new prediction for the monomorphic 
regime of evolution, consistent with data, that smaller populations should develop incompatibilities 
more quickly. This arises as: 1) smaller populations having a greater initial drift load, as there are 
more sequences that bind poorly than well, so fewer substitutions are needed to reach incompatible 
regions of phenotype space; 2) slower divergence when the population size is larger than the inverse 
of discrete differences in fitness. Further, we find longer sequences develop incompatibilities more 
quickly at small population sizes, but more slowly at large population sizes. The biophysical model 
thus represents a robust mechanism of rapid reproductive isolation for small populations and large 
sequences, that does not require peak-shifts or positive selection. 


INTRODUCTION 


Speciation is of great importance in generating the ob¬ 
served diversity of life, yet it is still poo^ understood, 
especially at the genetic level. Darwin [l|, despite the 
title of his magnum opus, struggled with the following 
problem, here phrased in a modern context: if hybrid 
inviability were due to heterozygote disadvantage at a 
single locus with alleles a and A, it is difficult to see how 
two species could have evolved from a single homozygotic 
ancestor without going through the inviable heterozy- 
gotic state. A resolution of this paradox was to propose 
that non-linear interactions or epistasis between different 
loci can give rise to so-called Dobzhansky-Muller incom¬ 
patibilities (DMI) d-Q; for example, two geographically 
isolated lineages evolving allopatrically from a common 
ancestor ab can fix the allelic combinations aB and Ab 
respectively, yet the hybrid genotype AB can be invi¬ 
able. It has also been recognised that a different sort 
of hybrid incompatibility can arise in polygenic systems, 
where many loci code for an additive quantitative trait. 
Quadratic selection induces epistasis such that divergent 
populations, under the action of drift, maintain different 
underlying allelic combinations at the many loci d, Q for 
the same optimal trait value, which when combined in hy¬ 
brids can lead to incompatibilities Q- Field data [sl-HOji 
suggest that the dominant mechanism for the evolution 
of (postzygotic) reproductive isolation (RI) involves the 
accumulation of Dobzhansky-Muller incompatibilities in 


geographically isolated populations with no or little gene 

flow [Th.[T^. 

Despite many studies of the evolution of RI, very little 
attention has been paid to the role of population size; 
however, there is indirect evidence that smaller popula¬ 
tions develop incompatibilities more quickly. In particu¬ 
lar, observations of the large species diversity in small 
habitats 1^ isl. such as cichlids in the East African 
Great Lakes fig, c ontrasted to the lower diversity of 
marine animals |l4 IT^ . [l8| and birds [l^, which have 
large ranges and population sizes, suggest that the rate at 
which RI develops increases with decreasing populations 
size. More directly, cichlids, whose effective population 
size is of order 100 — 10000 [13, IHj, develop reproduc¬ 
tive isolation on a timescale of 1 — lOMyr after diver¬ 
gence [i^, whilst domestic chickens {Callus gallus) can 
still hybridize with helmeted guineafowl {Numida melea- 
gris), even after roughly 55Myr divergence where es¬ 
timates of the effective population size of domestic chick¬ 
ens range between Nf, 10® to 10® [1^. This trend is 
further supported Q by inference of net diversification 
rates from phylogenetic trees 

Where does current theory stand in light of these ob¬ 
servations? There are a number of theoretical models 
of allopatric speciation based on the Dobzhansky-Muller 
mechanism, which consider independent lineages evolv¬ 
ing neutrallyor under varying selection pressures on each 


lineage [27H3.‘lj . Models which involve positive selection 


driving divergence are unlikely to be able to explain this 
dependence on population size, since larger populations 
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respond more quickly, for a given selection pressure [3lj |. 
This leaves models of speciation where populations di¬ 
verge neutrally or under similar stabilizing selection pres¬ 
sure; the models of Nei et. al and Gavrilets tackle 
precisely this question in the strong mutation regime 
{nfXoN > 1, where n is the number of nucleotides or 
base-pairs for the loci of interest, the base-pair muta¬ 
tion rate and N the population size). They find slower 
divergence in larger populations due to the lower mating 
success of members of the population who have diverged 
an amount comparable to the width of the fitness peak, 
resulting in a slower rate of developing RL However, in 
neither of these models is there a dependence on pop¬ 
ulation size in the weak mutation, nearly monomorphic 
regime, where nfioN <C 1. Models of hybrid incompati¬ 
bility that rely on fitness epistasis on quantitative traits, 
also predict that smaller populations should develop re¬ 
productive isolation more quickly, as drift helps popula¬ 
tions shift between stable equilibria more rapidly Q ; but 
again by it’s polygenic nature, we expect these results to 
be only relevant to the strong mutation regime. Johnson 
and Porter [3, [s^ examined the evolution of decreased 
hybrid fitness for simple models of gene regulation, under 
positive and stabilizing selection, in the clonal interfer¬ 
ence regime {nfioN ~ 1), but did not investigate the 
dependence on population size. More recently, they have 
extended their work with sequence based models [s^, 
showing decreased hybrid fitness with decreasing popu¬ 
lation size, however, these results are again in the regime 
where the effect of mutations are not weak (n/iplV ~ 1) 
and they did not investigate in detail the dynamics of 
the growth of DMIs. A model which could give rise to 
more rapid RI for small populations is based on founder 
events or peak-shifts, where small founder populations 
split and become isolated here genetic drift al¬ 

lows smaller populations to pass more easily through a 
fitness valley. A major problem with such models is that 
for isolation to occur on reasonable timescales the prod¬ 
uct of the fitness barrier and population size needs to 
be sufficiently small. However, this condition also means 
that gene flow is relatively unimpeded between peaks Q , 
destroying the reproductive isolation the model seeks to 
establish. Finally, the work of Orr and co-workers, pro¬ 
vided a framework to understand how incompatibilities 
might arise in allopatry through sequentially fixing muta- 
tions in the weak mutation regime (n/rpA^ <C 1) [23, [H; 
they showed that the number of potential or untested 
incompatibilities “snowballs” like ~ for interactions 
between pairs of loci. However, the starting point of 
this model is the assumption of neutral, population-size 
independent, divergence between lineages with a fixed 
probability that each untested combination is incompat¬ 
ible, and so cannot address the question of the popula¬ 
tion size dependence. In summary, although the mod¬ 
els of Gavrilets, Nei and Barton each predict a decreas¬ 
ing rate of developing RI with increasing population size 


when nfioN > 1, these models predict no dependence on 
population size, or are not applicable in the weak mu¬ 
tation, nearly monomorphic regime where n/rp7V <C 1. 
This is despite genetic studies which have shown that 
traits involved in species differences range from mono¬ 
genic through to mildly polygenic [131 . For traits which 
are not very polygenic, {nfioN <C 1), we still lack an un¬ 
derstanding of the effect of population size on the rate at 
which RI arises. 

In this paper, we examine how incompatibilities arise 
in allopatry for an abstract, yet biophysically motivated 
model of binding between two macromolecules, a protein 
transcription factor (TF) binding to a specific DNA or 
TF binding site (TFBS). Our model is based on the “two- 
state” approximation [4^ [43| , which although not cap¬ 
turing the molecular interactions in atomistic detail, can 
represent many salient aspects which have been ignored 
in previous work on speciation theory. In particular, such 
a model allows us to include the effects of drift-selection 
balance, due to some phenotypes being coded by more 
sequences than others, and the corresponding effect of 
population size on the speciation dynamics in the weak 
mutation regime {n^aN 1). Recent work has shown 
that such mappings from genotype to phe notype give rise 
to a number of non-trivial effects |44l449l | . Here, we find 
this simple genotype-phenotype map predicts an increas¬ 
ing rate of accumulating DMIs for decreasing population 
sizes in the weak mutation regime, the appropriate limit 
for monomorphically evolving traits, with a robust mech¬ 
anism that does not require valley crossing by either of 
the divergent populations. This dependence on popu¬ 
lation size arises due to the fact that there are many 
more sequences that give weaker binding than good, so 
the common ancestor of smaller populations, which are 
dominated by genetic drift, are on average closer to the 
inviability boundary. 

Gene expression divergence has been shown to be a 
major factor in driving differences between species [ 13 - 
1^ . providing indirect evidence of a role in speciation. 
In particular, compensatory changes at both cis and 
trans locations has been shown to be responsible for 
the misexpression of many genes in hybrids between 
D.melanogaster and D.simulans [b^ . as well as more 
direct evidence of speciation driven by the evolution 
of gene s related to transcription factors in Drosophila 
M,l^. With the increasing use of genome level studies 
10j| to study the process of speciation, there is a need for 
theory and modelling to bridge the gap between sequence 
level changes at co-evolving loci and phenotypic determi¬ 
nants of incompatibilities; the binding of transcription 
factors to DNA to control gene expression is arguably 
one of the most important co-evolving systems for or¬ 
ganisms and so makes an ideal case study to examine the 
consequences to speciation of a simple biophysical model 
and a first mechanistic insight on the way DMIs develop. 

The paper is organized as follows: We first introduce 
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a biophysical model of a transcription factor binding to 
DNA and the population genetic model of their evolu¬ 
tion. We then consider two populations evolving in¬ 
dependently, and consider the viability of reproductive 
crosses between these populations. 


METHODS 

Quaternary Model of Transcription Factor-DNA 
Binding 


The two-state approximation |42l . l43l | for transcription 
factor binding assumes that amino acid nucleotide inter¬ 
actions are either optimal or non-optimal and the con¬ 
tribution of each to the total binding energy is approxi¬ 
mately additive. The rationale for this model is the un¬ 
derlying biophysics of protein-DNA interactions, in par¬ 
ticular, the fact that an amino acid at a protein DNA 
interface will tend to have a preferred nucleotide with 
which to hydrogen bond, taking account the approxi¬ 
mately fixed orientation of the amino acid as positioned 
by the rest of the protein. The other nucleotides tend 
to be non-optimal and not able to hydrogen bond [s^ . 
Although each optimal interaction is marginally stabiliz¬ 
ing (—0.5kcal/mol [i^), it is the non-optimal nucleotides 
that dominate the binding free energy, since they can 
neither hydrogen bond to an amino acid nor to water 
molecules. Although this suggests a large cost for each 
non-optimal interaction, in reality this is highly depen¬ 
dent on the particular protein and DNA sequence; the 
cost in free energy per amino acid nucleotide mismatch 
can range from 1-2 kcal/mol (2-3kBT) [HI, to 4-5 
kcal/mol {d-SksT) [42, This is likely explained 

by specific co-operative effects that include electrostatic, 
steric and solvent interactions 5^, 6^ that change the en¬ 
ergy scale of binding dependent on a particular protein- 
DNA binding context. In this paper, for simplicity, we 
assume that e = SfcsT. 

As mentioned, for each amino acid there tends to a 
single nucleotide it prefers to hydrogen bond [s^l- If we 
designate the category of amino acids by its preferred 
partnering base (e.g. an amino acid in group T would 
interact preferably with a thymine), and recognize that 
only changes of amino acid group affect the binding prop¬ 
erties, we can use A, T, C, and G to represent letters from 
the quaternary alphabet for both proteins and DNA se¬ 
quences; for simplicity, this assumes that the amino acids 
are equally distributed amongst the four categories. In 
this way, the genome corresponding to this binding pro¬ 
tein - binding location pair consists of two ‘genes’ of 
length t in the standard four letter alphabet of DNA. 

For simplicity, we can then let the binding free en¬ 
ergy be proportional to the number of amino acid- 
DNA mismatches, equal to the Hamming distance r = 
duig^ ,9^), where the function dn counts the number 


of positions where the two sequences and g^ are not 
the same: 


AG = er, (1) 

where e is the energy scale for a given transcription fac¬ 
tor. This binding free energy corresponds to the specifi¬ 
cally bound mode of attachment (which has both specific 
and non-specific contributions) and is in thermodynamic 
competition with the non-specifically bound mode of at¬ 
tachment, which is purely electrostatic. The free energy 
of binding in the electrostatic non-specific mode is, 


AGns — tAEfi 


( 2 ) 


where is the effective increase in free energy per 

nucleotide in the non-specific mode relative to the best 
binder. Thermodynamic studies of Lac repressor bind¬ 
ing to DNA suggest that the difference in free energy 
between the best specific binding and the non-specific 
mode of binding is roughly ISfe^T, so as £ = 10 for the 
Lac respressor, we find AEna ~ l.SfcsT 42, As the 
number of mismatches increases between a DNA binding 
site and protein sequence, the probability of the non¬ 
specific mode of attachment increases, which we assume 
leads to a decrease in functionality of the site; for sim¬ 
plicity, we model this below using truncation selection 
with a critical number of mismatches r*. 


TF-DNA binding evolution 


Relating the binding energy of a TF to its binding site 
to the fitness of an organism is in principle very com¬ 
plicated. In general, we would expect that in order for 
a TF to find its binding site it would need to minimise 
the number of mismatches and so typically we might ex¬ 
pect that the fitness will increase with decreasing r. This 
is further supported by genome-wide studies of the dis¬ 
tribution of bindin g en ergies for different TFs in E.coli 
[i^ and yeast [b^. Ib^l. which show that there is a de¬ 
viation of this distribution from the random/neutral ex¬ 
pectation (EqnO below) for the best or lowest affinity 
binders. This deviation from the neutral distribution is 
related to selection for functional binding sites and has 
a character that suggests an effective (Malthusian) fit¬ 
ness landscape for binding energies, which is peaked with 
negative curvature. For simplicity, we therefore assume a 
quadratic log-fitness landscape, which is equivalent to a 
Gaussian fitness landscape (also referred to as a Wright- 
ian fitness landscape or sometimes as a Darwinian fitness 
landscape [6J). To model competition between the spe¬ 
cific and non-specific modes of attachment, we assume 
there is a critical number of mismatches r*, where the 
probability of binding in each mode is equal; this hap¬ 
pens when AG{r) = AGns, which from EqnlT]and Eqn[2] 
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gives r* = iAsns /This simply says that that as 
binding sites increase in length, the stability of the 
best binder (r = 0) relative to non-specific binding will 
increase in proportion to i and hence a larger number 
of mismatches will be required before a binding site be¬ 
comes non-functional. Specifically, for Ae = and 

Asns = lAksT [i^, we get the relation r* = if2. In 
the case of short DNA recognition sites for Eco RI en¬ 
donuclease cleaving DNA, where £ = 5, it was found that 
r* Ri 3 [5^ , which agrees well with our approximate rela¬ 
tion between r* and £. Thus we set the log-fitness to —00 
(Wrightian fitness to 0), for r > r* to model the situation 
where specific binding to the binding site of interest is no 
stronger than the non-specific mode of attachment: 


F(AG(r)) 


{ 


— ^Kpr'^ for r < r* 
—00 for r > r* 


(3) 


where up is the curvature of the fitness landscape and 
biologically, roughly corresponds to the strength of selec¬ 
tion of this trait; sls up decreases the fitness landscape 
becomes more shallow, and so for a fixed effective pop¬ 
ulation size the landscape becomes more neutral. Our 
choice of fitness function, essentially assumes that for 
AG — AG* > 0, the probability of the TF being bound 
to its binding site is zero; this is an approximation to 
the more correct functional form for the proportion of 
time the TF spends at its TFBS, which will be sigmoidal 
in form [43| with transition at the critical binding en¬ 
ergy AG* = Aer*. However, without a far more detailed 
model of how occupancy affects gene expression which af¬ 
fects fitness, which is beyond the scope of this work, we 
are left to choose some arbitrary occupancy threshold be¬ 
low which the organism is inviable; for simplicity, we have 
chosen AG*, as this threshold naturally corresponds to 
when specific and non-specific binding are equally like at 
the site. We expect our qualitative results to be robust to 
the choice of such a threshold. Similarly, a more detailed 
consideration would include binding of the TF to other 
spurious sites in the genome with large sequence similar¬ 
ity; again we expect such consideration will change the 
value of AG*, but not change the scaling relation r* oc i, 
as longer binding sites will always have a larger maximum 
affinity. 

To simulate the evolution of TF-TFBS sequence evo¬ 
lution we assume a diploid Wright-Fisher population ge¬ 
netic process with 2Ne copies of each gene in the popula¬ 
tion with a fixed effective population size of Ne- As we are 
in interested in the weak mutation regime {nfj,oNe 1), 
the simulations consist of a single fixed sequence for the 
TF-TFBS pair of loci at each time-point, where new mu¬ 
tations either fix or not, decided based on the probabil¬ 
ity of fixation; this process represents the evolution of 
a monomorphic population of effective size Ng. We as¬ 
sume full linkage disequilibrium within the TF and TFBS 
loci and linkage equilibrium between loci. In addition, 


we assume each loci is always homozygous on each lin¬ 
eage; when considering hybrid incompatibilities, the ini¬ 
tial product of any cross mating will be heterozygotic 
at all diverged alleles. All post-zygotic DMIs must be 
sufficiently deleterious to affect these heterozygotic off¬ 
spring. There may be many TF-TFBS pairs where the 
lack of cross binding in heterozygotes does not appre¬ 
ciably change the offspring’s viability. We will assume, 
however, that there are some TF-TFBS pairs that are 
sufficiently critical such that r > r* is sufficient to de¬ 
crease the gene expression level to the extent that the 
hybrid is inviable; these are the pairs that will be rele¬ 
vant for the speciation process, and therefore are the ones 
addressed by our model. 


We use the Gillespie algorithm [63 , to simulate evolu¬ 
tion as a continuous time Markov process; at each step of 
the simulation the rate of fixation of all one-step muta¬ 
tions from the currently fixed alleles (wildtype) on both 
TF and TFBS loci are calculated, and one of these muta¬ 
tions is selected randomly in proportion to their relative 
rate. Time is then progressed by K~^ ln(u), where K is 
the sum of the rates of all one-step mutants and u is a 
random number drawn independently between 0 and 1, 
which ensures the times at which substitutions occur is 
Poisson distributed, as we would be expected for a ran¬ 
dom substitution process. The rates are based upon the 
Kimura probability of fixation (^ : 


I-e- 2*^ ANeSF 

k — 2^oNe ^ _ ^-4N^SF ^ k-0 ^ 

where 5F is the change of fitness of a mutation at a par¬ 
ticular location and 2/ioAe is the rate at which muta¬ 
tions arise for each amino acid or nucleotide position in 
a diploid population, where we have assumed that the 
effective population size is the same as the number of in¬ 
dividuals N\ the latter approximation in EqnH] assumes 
5F 1. Note that although in the simulations we use 
the full form for the fixation probability, typically we 
would expect fitness effects to be small {5F <C 1), so the 
substitution rates only depends on the population-scaled 
fitness changes AN^SF which, for a given mutation, is 
proportional to ANgKp. In the rest of the paper we will 
refer to the scaled population size ANeKp to make it clear 
that either reducing Ne or kp (or both) can change the 
evolutionary outcomes from those dominated by selection 
to those dominated by drift. For each scaled population 
size and sequence length, 1000 replicates were run up to 
a time of = 500. In addition, simulations were ran 
up to a shorter time (dependent on the exact value of 
AKpNe) with 10® replicates in order to get reliable esti¬ 
mates of the very small probability of a DMI (FiglH]) at 
early times. 
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A biophysical model of reproductive isolation 

Using the above evolutionary process based on the bio¬ 
physics of a TF binding DNA, we study allopatric speci- 
ation by independently evolving two lineages in the fit¬ 
ness landscape defined by Eqn[21 We create an ancestral 
genome containing a protein and a DNA binding site 
gene, each of length £, with AG drawn from the equi¬ 
librium distribution of binding energies. This ancestral 
genome is then duplicated, with each copy representing 
the start of a different isolated population that subse¬ 
quently evolves independently. If the evolving protein 
and DNA sequences in one lineage are gf and gf and the 
other g 2 and , we can at each time point calculate the 
Hamming distance for each hybrid as hi 2 = dff(gf,g2*) 
and /i2i = dH{g 2 tQ?) with corresponding hybrid bind¬ 
ing energies, AG^ = Aehi 2 and AG|^ = Aeh 2 i. Using 
the same fitness function EqnjHl we can then evaluate 
the fitness of the hybrids as a function of time. An in¬ 
compatibility arises whenever the fitness of the hybrid is 
—oo {hi 2 > r*{£) or ^21 > i.e. when a hybrid TF- 

TFBS specific binding is weak compared the non-specific 
mode of binding and effectively can no longer recognise 
its target site. At this point, we assume that the two di¬ 
verging populations can no longer form viable offspring, 
and they are reproductively isolated. 

This model of TF-TFBS binding energies is inherently 
epistatic, despite the assumption that the contribution of 
each pair of interacting pair amino acid and nucleotide 
is independent and additive to the total binding energy. 
There is epistasis at both the phenotype and fitness level, 
the latter due to quadratic selection. Hence, although 
there is a similarity between our model and polygenic 
models of quantitative traits under quadratic selection, 
they are very different as in quantitative traits the phe¬ 
notype is additive in each loci Here in our model 

epistasis arises since at each location, say in the protein 
sequence, whether a given amino acid will give rise to a 
match or mismatch depends on the particular nucleotide 
that it is opposite; the binding energy phenotype is a non¬ 
linear function of the sequences at the TF and TFBS loci. 
It is this epistasis that is the source of the Dobzhansky- 
Muller incompatibilities that we find in our simulations 
described in the Results section. For example, the com¬ 
mon ancestor might be fixed for a pair of sequences 
^1^, which has a binding energy of AGqa = iksT, 
as there is only a single mismatch; after a period of di¬ 
vergence, two allopatric populations might be fixed for 
and each arising from just two substitu¬ 

tions, of compensatory effect, from the common ancestor 
sequence, so that AGi = AG2 = as there is still 

only a single mismatch. However, the hybrid sequences 
atcgc atagc ’ which correspond to binding en¬ 
ergies AGf^ = AG 21 = QksT, as they each have two 
mismatches. As the number of substitutions increases on 


each lineage, we can see that each lineage will maintain 
good fitness in a stabilizing landscape through compen¬ 
satory changes, which each try to minimize the number 
of mismatches; however, each lineage fixes different sets 
of compensatory mutations, so when combined in a hy¬ 
brid, the epistasis between pairs of sequences then gives 
rise to DMIs. 


RESULTS 

Evolution under stabilizing selection on each lineage 



FIG. 1. Equilibrium neutral distribution of binding energies 
AG for a 2-state model of TF-DNA binding with a quaternary 
alphabet for both amino acids and nucleotides. Solid black 
squares are KMC simulations of neutral evolution, where 
each sequence is of length ^ = 10 and the binding energy 
is AG = er, where r is the Hamming distance between the 
two sequences and e = SksT. We see that under neutral evo¬ 
lution the distribution U(AG(r)) is highly non-uniform. The 
solid line shows the distribution predicted by Eqn[Sl which is 
the relative number of sequences corresponding to a Hamming 
distance r = A®, 

e 

To understand the qualitative properties of TF-DNA 
binding evolution, we first consider neutral evolution of 
such a system and in particular, the resulting distribution 
of binding energies AG. The results of KMC simulations 
of neutral evolution of sequences g^ and g^ of length 
i = 10, and e = SksT^ with up = 0 and r* = 00 (where 
all sequences have equal fitness) are shown by the solid 
black squares in Fig[l] We see that even under neutral 
evolution of sequences, the distribution of binding ener¬ 
gies is non-uniform and roughly Gaussian with a peak 
between 22kBT and 23fesT. We can understand this by 
considering the many-to-one mapping often characteris¬ 
tic of the relationship of genotype to phenotype. In par¬ 
ticular, there will be a large set of sequences that result 
in the same binding energy. The number of sequences 
U(AG) that correspond to a given Hamming distance or 
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energy is non-uniform and given by the binomial distri¬ 
bution 


n{AG{r)) = 4 


21 


(5) 


where from Eqn[T]r = AG/e. For example, the number 
of sequences that give AG = 0 is fl{AG = 0) = 4^ rs 10® 
(for i = 10), as there is exactly one DNA sequence that 
matches to each one of the 4^ protein sequences. This 
number is very small compared to the number of se¬ 
quences that have 7 mismatches, n{AG = 21kBT) 

3 X 10^^, which is close to the mean of the distribution 
(AG) = 3e£/4: = 22.5A:bT. For neutral evolution, the 
resulting distribution of AG will match the number of 
sequences corresponding to each AG value. The distri¬ 
bution Q{AG) (normalized) expressed in FqnOis plotted 
as a solid line in FiglT] and we see excellent agreement. 
This effect of a non-uniform distribution of binding ener¬ 
gies of random se quen ces on evolutionary dynamics has 
been well studied 45-^, 6^ and measured empirically 
for various TFs in E. coli and yeast ; as we will 

this has a strong impact on the distribution of binding 
energies under selection at different population sizes. 



The distributions of binding energies resulting from 
evolution with the fitness function Fqn[3] at different 
scaled population sizes {AN^kf) are shown in Fig (21 for 
£ = 10 and r* = £/2 = 5. The distributions are con¬ 
fined to the region 0 < AG < AG*, where AG* = 
er* = ISfcsT, is the inviability boundary. Here, it is clear 
that fitness is not maximized, but instead there is a bal¬ 
ance between selection for higher fitness and the tendency 
to undergo drift towards those phenotypes which corre¬ 
spond to the largest number of sequences. For larger pop¬ 
ulation sizes, selection dominates, resulting in sequence 
pairs with high fitness. For smaller population sizes, 
stochastic effects due to drift are more important, result¬ 
ing in a shift to weaker (more positive) binding energies, 
approaching the neutral distribution as the population 
size decreases below the inverse of the overall difference 
in fitness on the landscape ^Kpier*)'^ . This results in a 
greater effective drift load for smaller population sizes. 

The binding energy distributions show that for a gen¬ 
eral genotype phenotype map fitness is not maximized, 
but instead there is a balance between selection for higher 
fitness and the tendency to undergo drift towards those 
phenotypes which correspond to the largest number of 
sequences. A powerful approach to dealing with this 
degeneracy is through the concept of sequence entropy 
[6 a l6^ , representing the (log) number of sequences en¬ 
coding a given phenotypic state (e.g. binding energy), 


S'(AG) = ln(0(AG)), (6) 

which is closely related to the Boltzmann entropy from 
statistical mechanics [g^. This entropy measure, should 
be distinguished from entropies of sequences due to poly¬ 
morphisms in the population (in this paper we have as¬ 
sumed populations are always monomorphic). The pre¬ 
cise combination of fitness and sequence entropy that is 
maximized during evolution is the function $ (AG ) = 
F{AG) + S{AG)/ANe, termed the free fitness [7^ ItH . 
from which the probability density is given by 


FIG. 2. Equilibrium distribution of binding energies AG as a 
result of evolution subject to the quadratic fitness landscape 
in Eqn[21 for £ = 10; the qualitative results for £ = {5, 20} are 
similar and not shown. The fitness landscape has a fitness 
cliff (inviability boundary) for r > r* = £/2 = 5 mismatches, 
or for binding energies greater than er* = ISfcsT, which rep¬ 
resents when the specific binding energy to its binding site 
is greater than the free energy of binding to the rest of the 
genome. The solid squares are results of KMC simulations, 
while the solid lines are the expected distribution from Eqn[71 
which we see agree very well. In addition, we see that the dis¬ 
tribution shifts from one dominated by fitness F{AG) at large 
population sizes {AkfN 3> 1) with a peak at the highest fit¬ 
ness binding energy to one dominated by sequence degeneracy 
at small population sizes {AkfN <C 1), which is peaked at the 
inviability boundary, representing the left tail of the neutral 
distribution in Fig[T] (shown in black). 


p(AG) = ie4^e<t(AG)_ (7) 

Zj 

where Z is a normalization factor, known as the partition 
function, given by Z = Y/Uo This probabil¬ 

ity density is plotted as solid lines in Figl2]for different 
population sizes, using Fans IhldtHl and 171 we see that the 
agreement between the two is excellent. 

This greater drift load is also illustrated in FiglU which 
shows the average binding energy and also the Hamming 
distance of the populations to the inviability boundary, 
as a function of the scaled population size AkfN, for 
sequence lengths £ = {5,10,20}; for the corresponding 
values of £, we choose r* = {3,5,10}, so as to approx¬ 
imately satisfy r* = £/2. We see the average binding 
energy (squares) is larger for smaller population sizes. 
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Scaled Population Size iKpNg 


FIG. 3. Average binding energy, (AG) = e{r), (left axis, 
squares) and average Hamming distance of populations from 
inviability boundary, r* — (r), (right axis, circles) as function 
of scaled population size AkfNs and sequence length i calcu¬ 
lated using KMC simulations. We see that as the population 
size is decreased the mean hamming distance or binding en¬ 
ergy (~ drift load) increases monotonically and towards the 
inviability boundary. 


which corresponds to populations being closer to the in¬ 
viability boundary as shown by the circles in Figl21 and 
hence also a larger drift load. For large population sizes 
{4:KpNe 3> 1), where fitness dominates, the drift load 
is zero, independent of Ne, as (AG) —?► 0. This means 
that, as shown in Figl31 the average Hamming distance 
to the inviability boundary increases for increasing se¬ 
quence length “ this arises trivially as r* (x £ - how¬ 
ever, for small population sizes {AnpNg <C 1) the average 
Hamming distance to the boundary is roughly indepen¬ 
dent of sequence length. To understand this we consider 
that for small populations the distribution is neutral and 
peaked at the inviability boundary r*{£), as shown in 
Figl2] and by the fact the mean binding energy is close 
to AG* = err*, for AkfN -C 1 in FigO at the inviabil¬ 
ity boundary the number of mutations that increase the 
Hamming distance is just the number of locations that 
are matched, multiplied by the number of nucleotides 
that can give a mismatch, 3(£ — r*{£)) = ?>£/2 and those 
that decrease it is just the number of mismatched loca¬ 
tions, r* = £/2. The ratio of these two quantities is in¬ 
dependent of £, showing that there is no net drift bias of 
the populations at the inviability boundary as £ changes 
and so for small populations the average distance to the 
inviability boundary is roughly independent of £. As we 
will see the initial distance of the common ancestor from 
the inviability boundary has a strong impact on the rate 
of accumulation of DMIs, as functions of population size 
and sequence length. 

We can also examine the population size dependence 
of the substitution rate per location in the stabilizing 
landscape defined by Eqn[31 as shown in FigH] by the 



FIG. 4. Average total substitution rate for both protein 
and DNA loci, on a single lineage as function of scaled 
population size 4 k_fA and sequence length £. Substitu¬ 
tion rate is plotted in units of the nucleotide mutation rate 
po- The solid circles represent KMC simulations, while the 
solid lines are the theoretical prediction of the average rate 
{k) = + 

where peir) is the equilibrium distribution of Hamming dis¬ 
tances (shown in Fig[2]) and 7r“ and tt’*' are the fixation proba¬ 
bilities for the transition r — >■ r —1 and r — >■ r-l-1, respectively. 


solid squares, obtained by simulation for £ = {5,10, 20}. 
We see that there is the same qualitative dependence 
on population size for each sequence length, which can 
be explained by the average size of fitness effects as the 
population size changes; at very large population sizes 
the distribution of binding energies is peaked at AG = 0 
and so the average substitution rate will be dominated 
by transitions between r = 0 and r = 1; forward transi¬ 
tions to r = 1 happen rarely since the population scaled 
difference in fitness, ANe,5F = —2kfN£^, will be nega¬ 
tive with magnitude much greater than 1 for AnpN ^ 1 
and so substitutions will occur significantly slower than 
neutral. While at very small populations although the 
inverse of the population size is much larger than differ¬ 
ences in fitness, since populations spend a large fraction 
of the time at the inviability boundary r*, the substi¬ 
tution rate is also diminished compared to the expected 
neutral rate ^o, since a fraction [£ — r*)l£ of mutations 
at this boundary are inviable and are never accepted in 
the population. It is interesting to note that this form 
of the population size-substitution rate relation is qual¬ 
itatively similar to what would be expected in a simple 
stabilizing landscape (t^ . however, here at small popu¬ 
lations, sequence degeneracy combined with drift pushes 
populations to the inviability boundary giving rise to an 
effective substitutional drag relative to the neutral rate. 

We find a non-trivial dependence of the substitution 
rate on sequence length; at large population sizes, as 
expected, the substitution rate per location is indepen- 





















dent of sequence length, but strongly diminished com¬ 
pared to the neutral rate /io, as discussed above, due 
to the discrete changes in fitness being larger than the 
inverse of the population size. For small populations, 
we also find that the substitution rate is roughly inde¬ 
pendent of sequence length; as the distribution of bind¬ 
ing energies is peaked at the inviability boundary the 
substitution rate will be proportional to the number 
of viable substitutions multiplied by the neutral rate, 
~ = /ro/2, which as observed in Figgis inde¬ 

pendent of 1. However, for intermediate population sizes, 
where 1 the average substitution rate decreases 

with increasing sequence length. In the large and small 
populations size limits, all substitutions are either non¬ 
neutral or neutral, respectively, for 0 < r < r*. However, 
for intermediate population sizes the quadratic fitness 
landscape means there is a critical Hamming distance, 
r*jy Ri below which substitutions are effec¬ 

tively neutral {4:Ne\SF\ <C 1) and above are non-neutral 
{4:Ne\SF\ ^1). The effective substitution rate will then 

be roughly ~ ^ ji, where a(€) = Er=o w(^) 

is the proportion of time, at equilibrium, spent in the 
nearly neutral region and ji is the fraction of nearly 
neutral substitutions at we expect that a(£) will 

decrease for increasing since we find that p(.{r) shifts 
to larger values of r as £ increases (not shown), due to 
an increased degeneracy pressure, as the sequence length 
is increased. So together with the fact that the fraction 
of nearly neutral mutations decreases for increasing 
like we see that the average substitution rate is 

smaller for larger sequence lengths at intermediate pop¬ 
ulation sizes {AkfNs = 1). 


Rate of accumulation of hybrid incompatibilities 

To study the speciation process, we perform replicate 
simulations of pairs of lineages using the KMC scheme 
outlined above, with fitness given by Eqn. O where each 
simulation starts with two identical sets of sequences with 
AG drawn from the equilibrium distribution of binding 
energies as shown in Figl21 We first plot the average hy¬ 
brid binding energy as a function of pot in FigO At 
zero divergence, the average hybrid binding energies are 
equal to the average binding energies for that population 
size, as shown in Figl^l For long divergence times, the 
hybrid binding becomes weaker, with the binding ener¬ 
gies increasing to a value AG^ = 22.bkBT, irrespective 
of population size, corresponding to the mean of the neu¬ 
tral distribution in Fig[TJ this is exactly what we would 
expect after a long period of divergence, as protein and 
DNA sequences from different lineages should have ef¬ 
fectively random interactions. The rate at which this 
neutral distribution is reached depends strongly on pop¬ 
ulation size in an approximately monotonic manner, as 
would be predicted from the average substitution rate 



FIG. 5. Average hybrid binding energy {AG^) as a function 
of time after divergence from common ancestor fiot for £ = 
10 - the qualitative results for I = {5, 20} are similar and 
not shown. The inset shows the root mean square deviation 
'^AG^ ~ \J ((AG-^ — (AG^))^) of hybrid binding energies as 
a function of divergence time. 


seen in FiglU The inset of FigO shows the root mean 
square, ctag" = V ((AG^ — (AG^))^) of hybrid bind¬ 
ing energies vs /rpt on a log-log scale; we see that in the 
limit of large population sizes that (t^qh ~ sug¬ 

gesting that the underlying dynamics of the hybrids is 
effectively diffusive, as suggested by more coarse-grained 
models 



FIG. 6. Average probability of a DMI as a function of time 
after divergence from common ancestor fiot calculated from 
KMC simulations for various scaled population sizes, for £ = 
10; the qualitative results for £ = {5, 20} are similar and not 
shown, but the trends with sequence length are demonstrated 
in FigEl 


The probability of DMIs Pi{t) as a function of pot 
is plotted in FigO for various values of AnpNe and 
for £ = 10. The qualitative behavior of the plots for 
i = {5, 20} are similar and for clarity not shown; however, 
we examine below the dependence of F/(t) on i through 
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the typical time for reproductive isolation to arise in 
FiglT] We see that the model predicts a very strong pop¬ 
ulation size effect for the dynamics of hybrid incompati¬ 
bilities; as the population size decreases the timescale for 
DMIs to arise sharply decreases. This effect saturates for 
very small population sizes, but diverges for very large 
population sizes, to the point that reproductive isolation 
will take extremely long times for very large population 
sizes {4:NeKp 3> 10). This trend can be understood to 
arise from two effects: 1) as the population size decreases, 
the initial drift load of the common ancestor is on average 
larger and so fewer substitutions are required between a 
pair of divergent lineages for an incompatibility to arise 
in a hybrid (shown by Figsl2K4Sl); 2) as the population 
size increases beyond AnpN ^ 1 the substitution rate 
on each lineage decreases significantly, as seen in FigHJ 
which increases the observed time for incompatibilities to 
arise. For small population sizes the increase in DMIs is 
quadratic at small times <C 1), while for large pop¬ 

ulation sizes there is a very rapid increase in DMIs, which 
does not seem to fit a power law and suggests a finite 
negative curvature on a log-log scale. We note that our 
prediction at small population sizes and times is the same 
as Orr’s 13, 2^, but the underlying mechanism in this 
2-loci system is very different as it arises from an aver¬ 
age over the equilibrium distribution of common ancestor 
binding energies (Figl^]). The behavior seen at large pop¬ 
ulation sizes is consistent with theoretical predictions of a 
coarse-grained model of TF-DNA binding evolution [68| . 
where the growth of DMIs is rapid with the asymptotic 
form, as t —>■ 0 of Pi(t) ^ erfc(I/-\/t) ~ which 

cannot be expressed as a power law for small times. This 
form arises when considering the distribution of times to 
diffuse to the incompatibility boundary starting from a 
fixed binding energy; these are the conditions found for 
KMC simulations at large population sizes, where hybrid 
binding energies show neutral diffusive dynamics (inset 
Fig|5]) and the equilibrium distribution for the common 
ancestor is highly peaked (Figl^]). Finally, we see that at 
the intermediate population size of 4,KpNf, = 1, there is 
a transition from the power-law behavior at short times 
and non-power law at long times, with the transition at 
approximately /iot ~ 0.1; this would be as expected if 
the short-time behavior arises from common ancestors 
drawn from the right-tail of the probability distribution, 
near the inviability boundary, for AnpNe = 1 in Figl21 
whilst the long-time behavior arises from common an¬ 
cestors drawn from around the peak of the distribution, 
which are further away from the boundary. 

In a full genome, where there are many possible inter¬ 
acting genes, it will typically be the short-time behavior 
of each interacting pair that will dominate. If we as¬ 
sume roughly m ^ 10 interaction partners per gene and 
nc ~ 2 X 10^ protein coding genes, we have roughly 
M = ^mna ~ 10® interaction partners. As only a single 
one of these interactions giving rise to a DMI is required 


for RI, we would expect the probability that RI has arisen 
is PRi{t) = 1 — {1 — Pi{t))^, which at short times is given 
by PRi{t) Ri 1 — e-MPiW ^ In FigQis plotted the time 
t* at which P/(t*) = 10“®, for £ = {5,10,20}. We see 
there is a strong population size dependence on the rate 
at which RI develops and a weaker, but still significant 
one on the sequence length. In particular, we see for 
small populations RI can arise quite quickly, on times 
where not* Ki 0.0005, for £ = 20, which corresponds to 
^ 250,000 generations, assuming no = 2. x 10“®. As 
discussed above a major determinant at large population 
sizes on the time for RI to develop is the rate of substi¬ 
tutions on each lineage, the inverse of which is plotted as 
a dashed line in FigH we see that although the inverse 
substitution rate is a good predictor for large popula¬ 
tion sizes, for small populations it fails. This is due to 
the larger drift load for smaller population sizes, which 
reduces t* further. 


We see that the rate of growth of DMIs and the time 
for RI to arise has a complicated dependence on the se¬ 
quence length £] for small populations sizes {AupNe 1), 
RI develops more quickly for longer sequences, whilst 
for intermediate and large population sizes (AKpNg > 1) 
this trend is reversed and longer sequences mean RI de¬ 
velops more slowly. The divergence rate of the two al- 
lopatric populations will be controlled by the total sub¬ 
stitution rate for both protein and DNA loci, which is 
2£{k); for small populations, this trend arises, trivially, 
from the fact that the per location substitution rate (k) 
is roughly independent of sequence length (as shown in 
FigH]), giving a higher rate of divergence for larger se¬ 
quences, together with the fact that the average number 
of substitutions needed to reach the inviability boundary 
r* (£) — (r) is independent of sequence length (as shown by 
Figl3]). For large population sizes, RI arises more slowly 
for larger sequences, despite the fact that, like at small 
population sizes, the overall divergence rate of the two al- 
lopatric populations is larger for longer sequences. If we 
assume that for large populations the dynamics of the 
hybrids is diffusive (as suggested by the inset of FigO, 
then the mean square Hamming distance should increase 
linearly with time (r^) ~ 2£{k)t [t^. We then would 
expect t* ~ ~ to increase linearly with £, as 

(k) is independent of £ (as shown in Fig[5]); the exact 
values of speciation times are t* = {37.5,72.1,158}, for 
£ = {5,10, 20}, so we see that at each doubling of £, t* is 
roughly doubled, lending support to the diffusive model, 
as well as explaining the trend of a longer t* for longer 
sequences in FigI7]for large populations. For intermedi¬ 
ate populations sizes {AKpNe = 1), we have the same, but 
stronger trend, which is due to the fact that the per loca¬ 
tion substitution rate (fc) is smaller for longer sequences, 
as shown in Figdland so giving a t* which grows faster 
than linear with respect to £. 
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FIG. 7. Time for reproductive isolation (RI) to arise as a 
function of scaled population size ‘inpN and sequence length 
I, defined as the time t* when the average probability of a 
DMI crosses a threshold value of 1/M = 10~®, where M is 
the typical number of interaction partners of a protein in a 
genome. The black dashed line corresponds to a plot of the 
inverse of the average substitution rate shown in FigU) 


DISCUSSION & CONCLUSIONS 


Dobzhansky, Muller & Bateson n provided the first 
solution to Darwin’s conundrum of how speciation might 
arise by suggesting that in allopatry incompatibilities 
form between co-evolving loci on an epistatic fitness land¬ 
scape. Many studies have since suggested that the dom¬ 
inant form of reproductive isolation involves the accu¬ 
mulation of Dobzhansky-Muller incompatibilities in ge¬ 
ographically isolated populations with no or little gene 
flow fill . [l3 | . The observation of the large diversity 
of species on small yo ung islands, such as Hawaii [1^, or 
on the island of Cuba [l5| and East African Great Lakes 
II [3 , where in the latter two cases each have been sub¬ 


ject to historically fluctuating water levels and thus op¬ 
portunities for allopatric speciation, suggest that smaller 
populations speciate more quickly. This is in contrast to 
lower levels of reproductive isolation observed in marine 
species with large ranges and population sizes; for ex¬ 
ample, the relatively small fraction of Pacific-Caribbean 
species pairs separated by the Isthmus of Panama a few 
million years ago compared to those that are not repro- 
ductively isolated [I1B0. There is also evidence thar 
reproductive isolation arises more slowly in birds com¬ 
pared to mammal s 11 911 . Strikingly, even after roughly 
55Myr divergence [23j |. domestic chickens {Gallus gallus) 
can still hybridize with helmeted guineafowl {Numida 
meleagris), where estimates of the effective population 
size of domestic chickens range between Ne ~ 10^ to 10® 
[ 2 ^ . whereas in contrast, cichlids develop reproductive 
isolation as quickly as 1 — lOMyr after divergence [ 2 ^ 
and have relatively small population sizes (100 — 10000 


01111 ). This population size trend is further supported 
by net rates of diversification inferred from phyloge¬ 
netic trees [251 I2(tI|. Although, the models of Gavrilets 
0, Nei |30l| and 3 , predict that very polygenic traits 
or those under high mutation rate will tend to show this 
population size trend, they predict no population size 
dependence, or are not applicable in the weak mutation 
regime {nfioN 1). In particular, data on the genetic 
nature of species differences, suggest many traits involved 
are oligogenic, involving only a few loci |41| and so it is an 
open question to explain the population size dependence 
of speciation for such monomorphically evolving traits. 

Here, we have developed a biophysically motivated 
model of how incompatibilities arise in allopatric popu¬ 
lations, using a simple model of the co-evolution of tran¬ 
scription factors binding to DNA in the weak mutation, 
monomorphic regime. A key aspect which this biophysi¬ 
cal model of evolution introduces to the picture of fitness 
landscapes is the idea that many sequences can result 
in the same phenotype, that is the number of sequences 
corresponding to each phenotype can be very different, 
and this uneven distribution can have important conse¬ 
quences for the evolutionary process. As described, our 
results arise due to a drift-selection balance, which can 
be cast in the language of a balance between fitness and 
sequence entropy. The maximum of the free fitness land¬ 
scape, corresponds to the phenotype when these two evo¬ 
lutionary forces are balanced; importantly, this balance 
is dependent on the population size. Here, for TF-DNA 
binding there are many more sequences that have a large 
number of mismatches compared to those few high fit¬ 
ness sequences that have a small number of mismatches; 
at smaller population sizes genetic drift dominates push¬ 
ing the equilibrium towards less fit sequences. This has 
an important consequence for the dynamics of reproduc¬ 
tive isolation, that smaller scaled populations on average 
have common ancestors with a larger drift load and so 
a smaller number of substitutions are needed for an in¬ 
compatibility to arise in hybrids. This leads to the main 
prediction of the paper that smaller scaled populations 
{4:KfN(, 1) develop incompatibilities more quickly. At 

larger scaled population sizes {AKpNg » 1, but still in 
the weak mutation regime, nfioN <C 1), where fitness 
dominates drift we And this trend continues, but for a 
different reason; when 3> 1 populations no longer 

diverge neutrally and instead need to fix deleterious mu¬ 
tants whose difference in fitness is large compared to the 
inverse of the effective population size. This means that 
the time for reproductive isolation becomes very long for 
very large populations. Note however, that although our 
theory strictly applies to the monomorphic regime, we 
also expect the effect of sequence degeneracy/entropy to 
lead to a similar trend of an increasing rate of repro¬ 
ductive isolation for decreasing scaled population size for 
polymorphic loci, where in addition the effect would be 
reinforced by the slowed divergence of allopatric lineages 
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due to the mechanism of Gavrilets and Nei [s^. In 
particular, recent work with a similar sequence based 
model, but in the regime where the effect of mutations 
will be strong, showed that smaller populations are more 
likely to develop poor hybrid Htness, however, no mech¬ 
anistic cause is given in their work for this trend and the 
dynamics of the accumulation of DMIs was not investi¬ 
gated. 

We also investigated the effect of sequence length 
on the rate of developing reproductive isolation. We 
find that TF-DNA binding with a larger number of nu¬ 
cleotides results in reproductive isolation arising more 
rapidly for small populations 4,KpNe -C 1, but less rapid 
at intermediate and large populations {AnpNe > 1). For 
small populations, we find the average Hamming distance 
to the inviability boundary and the average substitution 
rate are independent of sequence length and so repro¬ 
ductive isolation develops more rapidly because longer 
binding sites have a larger overall substitution rate and 
so the two allopatric lineages divergence more quickly. 
Conversely, when 4:KpNe > 1, despite the same depen¬ 
dence of the average substitution rate on sequence length, 
longer binding sites are more stable and so require a 
larger number of mismatches to destabilize the TF to 
prevent binding to its correct site, we model this simply 
by having an inviability boundary r* (x £. Guided by 
our simulation results (inset FiglS]), as well as theoretical 
studies [i^, which suggest that the hybrid binding en¬ 
ergies are diffusive, this then suggests that the time for 
reproductive isolation to arise should grow linearly with 
sequence length, which we find is in good agreement with 
our simulations. 

Our model then provides a rationale for the observa¬ 
tion in the field that smaller populations develop DMIs 
more quickly, with a robust mechanism that does not re¬ 
quire that either lineage pass through a fitness valley. It 
also, for the first time, provides an insight, through a bio¬ 
physical model, of the mechanistic causes of how DMIs 
develop for co-evolved pair-wise molecular interactions. 
While we would not expect quantitative agreement with 
biological systems, we can make a rough comparison to 
empirical data: our results suggest that reproductive iso¬ 
lation can occur on a timescale of order a few hundred 
thousand generations for small scaled population sizes. 
Direct studies of interspecific hybrids of African cichlids 
[ 2 ^ show that post-zygotic isolation typically arises over 
a timescale of ~ 4 — 18Myr, which corresponds to roughly 
~ 1—6 million generations, assuming a generation time of 
3 years [zi, which suggests the mechanism we present is 
consistent with empirical data. Importantly, we see that 
this mechanism can provide relatively rapid reproduc¬ 
tive isolation between lineages with only nearly neutral 
evolution, without having to invoke positive selection or 
peak-shifts. 

The model studied, however, is simplified compared to 
the complexity of gene regulation in eukaryotes with mul¬ 


tiple TFs binding to enhancers to control gene transcrip¬ 
tion and each TF having multiple binding sites control¬ 
ling many different genes. Here, we treat TFs and their 
binding sites on an equal footing and so for example, the 
substitution rate in each is the same. It is commonly 
thought that since TFs are under stronger pleiotropic 
constraints, they evolve more slowly and so much of the 
phenotypic divergence between species is driven by cis- 
regulatory change (and reviewed recently by (au¬ 

thor?) [76|). We expect that as pleiotropy will act to 
reduce the substitution rate on a TF, the divergence rate 
of allopatric lineages will decrease. This suggests that if 
pleiotropy is important, our simulations may underesti¬ 
mate the average time to reproductive isolation. How¬ 
ever, there is increasing evidence that protein evolution 
driven by protein-protein interaction together, for exam¬ 
ple, with tissue specific TFs can reduce the pleiotropic 
constraints on TFs 7611 . 


predicts that 


Previous theoretical work by Orr [27l . 
in the weak mutation regime, the number of incompati¬ 
bilities should increase as ^ from a fixed common an¬ 
cestor, due to the combinatorial possibilities over a large 
number of pair-wise interacting loci. Here, we predict 
the same growth of DMIs with time, but only for small 
scaled population sizes {AkpN <C 1) and for a single 2- 
loci system. However, the underlying mechanism appears 
to be very different here; the quadratic law arises due 
to averaging over the distribution of the initial binding 
energy (or effective drift load) of the common ancestor, 
which is roughly equivalent to averaging over the growth 
of DMIs for the different initial drift loads that each pair 
of loci will have across the whole genome within a single 
common ancestor. On the other hand, for large pop¬ 
ulations, which have a peaked distribution of common 
ancestors relative to the Hamming distance to the invia¬ 
bility threshold r *, we observe that the growth of DMIs 
does not appear to be described by a simple power law, 
but instead the results suggest there is a negative curva¬ 
ture to their growth on a log-log plot. In addition, we 
find that the variance of binding energies increases lin¬ 
early with time in the limit of large populations (inset 
Fig.SI), so together with our results that indicate t* ~ i, 
this suggests that from a given common ancestor the hy¬ 
brid binding energies follow neutral diffusive dynamics. 
Together, this is as predicted by a simple calculation of 
the growth of DMIs due to a continuous diffusion model 
for the evolution of TF-DNA binding and arises due 
to the fact that from a fixed common ancestor there is a 
finite mutational distance that needs to be diffused by hy¬ 
brids before incompatibilities can arise; in the low scaled 
population size limit this behavior turns into a power law 
when averaged over a broad distribution of common an¬ 
cestors. We suggest that more detailed studies of species 
divergence, similar to current works 77 , ^ , which show 
a rapid increase in DMIs, should be able to discern be¬ 
tween these two qualitatively different behaviors at dif- 
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ferent population sizes. In particular, recent cross-species 
ChiP-seq analysis of transcription factor binding (79| sug¬ 
gests a way to explicitly test our predictions at the level 
of actual binding affinities of hybrid TF-TFBS combina¬ 
tions for recently diverged species, such in the Drosophila 
family. 

The process of speciation underlies the vast diversity 
of life on Earth. We expect these results to be also seen 
in more complex models of co-evolving loci since the bal¬ 
ance between sequence entropy and fitness poising popu¬ 
lations nearer or further away from incompatible regions 
in a population size dependent manner is likely to be 
general. Gene expression divergence is thought to un¬ 


derlie many differences between species |50l - l52| , for ex¬ 


ample, in the Galapagos finches [80|, the various species 
of Drosophila and with more direct evidence of a 
role in speciation through the evolution of genes related 
to transcription factors [H, . More recently studies 

of crosses between D. melanogaster and D. santomea, 
which diverged more than 10 million years ago, have re¬ 
vealed how the cryptic divergence of genetic architecture 
of conserved developmental body plans leads to postzy- 
gotic isolation [sij. Proteins binding to DNA to control 
gene expression is a prototypical co-evolving system and 
critical for the proper development of organisms, thus 
these results have strong implications for speciation rates 
and diversity of populations at small population sizes. In 
addition, although our model is motivated by DNA pro¬ 
tein binding, the approach could be adapted to any type 
of interacting macromolecules, for example, co-evolution 
of protein-protein interactions or the interaction of genes 
expressed by nucleus and mitochondria, where in partic¬ 
ular, such interactions have been shown in yeast to give 
rise to cytonuclear incompatibilities 82, 8^. 
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